## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
The dataset contains more than 100,000 loans’t data with 81 features for each loan from Prosper Marketplace Loan company. The dictionary described what does each variable represents. The variables that worth investigating was picked by my limited knowledge in finance. However, later in the analysis, more variables might be added in analysis according to correlations between variables.
## 'data.frame': 113937 obs. of 22 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ TotalCreditLinespast7years: int 12 29 3 29 49 49 20 10 32 32 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## [1] "ListingKey" "ListingCreationDate"
## [3] "Term" "LoanStatus"
## [5] "CreditGrade" "ListingCategory..numeric."
## [7] "ProsperScore" "BorrowerState"
## [9] "Occupation" "EmploymentStatusDuration"
## [11] "IsBorrowerHomeowner" "CreditScoreRangeLower"
## [13] "CreditScoreRangeUpper" "IncomeRange"
## [15] "IncomeVerifiable" "TotalCreditLinespast7years"
## [17] "TotalProsperLoans" "LoanOriginationDate"
## [19] "LoanOriginalAmount" "PercentFunded"
## [21] "Investors" "StatedMonthlyIncome"
20 variables and 113937 observation are picked for analyzing.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 23 6213 11557 11263 2206 5530 11442 19556 35413 10734
Geographical information is critical for the growth of a company. Firstly, analyze what state owns the most of loans on Proser Marketplace. However, since the population density is vary from state to state, this result may be less informative than people normally think.
##
## AK AL AR AZ CA CO CT DC DE FL GA
## 5515 200 1679 855 1901 14717 2210 1627 382 300 6720 5008
## HI IA ID IL IN KS KY LA MA MD ME MI
## 409 186 599 5921 2078 1062 983 954 2242 2821 101 3593
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 2318 2615 787 330 3084 52 674 551 3097 472 1090 6729
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4197 971 1817 2972 435 1122 189 1737 6842 877 3278 207
## WA WI WV WY
## 3048 1842 391 150
From the bar plot above, we can see that California has most total loans. Meanwhile, the total number of loans distributed similarly as population. One interesting find is there are 5515 loans has unkonwn state, are they from out of United States or it is data entry error? It is hard to allocate these loans to a specific states without additional data from Prosper Marketplace.
Will loaner’s occupation affects there chance of getting investors? Is there any occupation discriminate on Prosper MarketPlace? To better answer these question, total amount of loans for different occupations are analyzed first.
Obviously, the dropdown menu on Prosper Marketplaces’s website is not exhaustive. The occupation “Other” siginificantly skewed the distribution.
## [1] 136 127 0 39 11 33 42 422 284 166 3 94 10 18 6
As shown in the graph above, most of the loaners provided employement data on this website do have a fairly short employment status history.
The credit score records is dvided into two features: 1. CreditScoreRangeLower 2. CreditScoreRangeUpper Firstly, the relationship between this two features is determined.
##
## 19
## 113346
##
## 19 379 439 459 479 499 519 539 559 579 599 619
## 133 1 5 36 141 346 554 1593 1474 1357 1125 3602
## 639 659 679 699 719 739 759 779 799 819 839 859
## 4172 12199 16366 16492 15471 12923 9267 6606 4624 2644 1409 567
## 879 899
## 212 27
As show above, the the creditscorerangeupper is mostly 19 more than the creditscorerangelower. According to the properties of expected values. Only one of these variables needs to analyzed.
As shown above, the distribution of credit score shows approximate normal distribution if we neglect outliers at 19.
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
As show in the graph above, the income range of lenders is a good representation of financial state of lenders. Quite a number of incomes is not displayed. But mostly the income range of lenders is concentrated below $50,000.
The purpose of analyzing loan origination date is verify of the data
##
## 4000 15000 10000 5000 2000 3000 25000 20000 1000 2500 7500 7000
## 14333 12407 11106 6990 6067 5749 3630 3291 3206 2992 2975 2949
## 6000 3500 8000 12000 9000 13000 1500 4500
## 2869 2567 2442 1921 1695 1509 1507 1406
Most of the loan amount is on numbers that looks more “integer” like 4000, 15000 insteand 15232 or 4355. Because the Prosper MarketPlace doesn’t gurantee this amount will be fully funded, its make sense that most lenders put a approximate number to take the most from investers.
There are multiple status of loans per data dictionary:
Cancelled Chargedoff Completed Current Defaulted FinalPaymentInProgress Past due(Followed by buckets)
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
As seen in this graph, the mojority of loans is in their “Current” Status. Becuase the loan status is also the main feature of our analysis, we need to go one step further on.
The loans status feature has 12 levels:
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
To make the analsis easier in later stage, these levels will be divided into 2 groups: Good loans and Bad loans, in the other word, the deliquency state of loans. After read the wikipedia and investopedia page of these words, they are divided as follow:
A new column called “DeliquencyState” is created based on the creteria above. Under this column, 1 represent loans with deliquencies and 0 represents loans without deliquencies.
##
## delinquent GoodStanding
## 18006 95931
At the beginning of univariate analysis
The main features of interest in my analysis are the status of the loan and the number of investors.
There are multiple status of loans per data dictionary:
Cancelled Chargedoff Completed Current Defaulted FinalPaymentInProgress Past due(Followed by buckets) …
They are divided and regrouped into a new feature: DeliquencyState which is now the feature of interest.
‘ListingKey’ ‘ListingCreationDate’ ‘Term’ ‘LoanStatus’ ‘CreditGrade’ ‘ListingCategory..numeric.’ ‘ProsperScore’ ‘BorrowerState’ ‘Occupation’ ‘EmploymentStatusDuration’ ‘IsBorrowerHomeowner’ ‘CreditScoreRangeLower’ ‘CreditScoreRangeUpper’ ‘IncomeRange’ ‘IncomeVerifiable’ ‘TotalCreditLinespast7years’ ‘TotalProsperLoans’ ‘LoanOriginationDate’ ‘LoanOriginalAmount’ ‘PercentFunded’ ‘Investors’ ‘ListingCreationYear’ ‘creditrange’ ‘LoanOriginationYear’ ‘DeliquencyState’
Yes, I created two variables to help my analysis: 1. LoanOriginationYear: this variable is parsed from LoanOriginationDate data in order to determine if the sample are radomly picked through years.
I changed the data type of LoanOriginationDate to datetime so I can parse the year of loan origination much easier.
This section is divided into 3 parts: 1. Relationship between loan status and support features. 2. Relationship between number of investors and supporting features 3. Relationship between some supporting features
Although it is obvious that borrowers that has income range from 25,000 to 49,999 has most of the delinquent loans, they also has most of loans in general. A more reasonable feature in this kind case would be the percent of deliquent loans compare to the total loans in that income range.
In contraray to the intuition, delinquent loans has lower loan amount compare to good standing loans. I was expecting delinquent has higher loan amount which makes them harder to payoff.
The boxplot, in this case, does not make too much sense because of significant amount of data points that out of quantiles from the mean. Because of that, I used the point plot to have visulize the employement duration.
Because all listing categories are in numeric forms, it is necessary to map the english descriptvie name to them according to the data dictionary.
Will higher credit score attract more investors? A scatter plot with trend line might be able answer this question.
Since it is barely possible for a person has credit score of 0, I only plotted loans with credit score above 0. Although the relationship between credit score and is approximately position, it is noticeable that there is a downward trend after credit score go over 780.
Interestingly, weather the income verifiable does not have obvious influence on the number of investors.
From the correlation plot above, we found fairly strong correlations between Investors and loan amount, credit score and prosper score. These three variables will be explored deeper in next plot. However, it is also necessary to keep in mind that ProsperScore have high correlations of .4 with creditrange.
For the deliquency of loans, the duration of employement and credit score seems have most obvious relationship with the deliquency state. Longer employement history and higher the credit score seems relevant to lesser deliquent loans.
As of the number of investors, the credit score and loan categories both have effects. The credit score has much more significant effects on the number compare to loan categories. The number of investors increased as the credit score goes up.
The debt consolidation has highest median for loan amount amoung all other categories. Medical and dental loans does not own a high loan amount. The reason might be the residents of United States heavily rely on insurance and law service for health coverage.
The strongest I found is the relationship between credit score and the number of investor.
##
## Call:
## lm(formula = Investors ~ LoanOriginalAmount + CreditScoreRangeUpper,
## data = loan_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -266.38 -59.78 -7.54 40.92 1027.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.405e+02 3.063e+00 -45.86 <2e-16 ***
## LoanOriginalAmount 5.368e-03 4.773e-05 112.46 <2e-16 ***
## CreditScoreRangeUpper 2.502e-01 4.489e-03 55.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.42 on 113343 degrees of freedom
## (591 observations deleted due to missingness)
## Multiple R-squared: 0.1666, Adjusted R-squared: 0.1666
## F-statistic: 1.133e+04 on 2 and 113343 DF, p-value: < 2.2e-16
The amateur regression model here does not represent the actual relationship between scores, loan amount and the number invester. Instead, p value less than 0.05 shows credit scores and loan amount do have statistically significant impact on the number of investers. From the Q-Q plot, it can ben seen that the distribution of investors is not normal, which is also shwon in the univariate analysis. From the Leverage plot, there’s a outlier located on the left of the graph. T his group of points indicate that there are mulitple outliers while doing the regression. From the residual plot, it can be seen that as the number of investors goes large, its become harder and harder to predict the number of investors.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
Relationship between number of investors and credit scores, derived in bivariate analysis are used in conjuntion with the original loan amount in multivariate. The result is significant, these two variables can be used together to built a regression model to predict the number of investors.
Seems the prosper score, which is the socre assigned by prosper loan marketplace has strong correlation with the credit score assigned by credit union. This is either because the prosper loan market place assign their score based on credit score from credit union or people’s bevaior on the prosper market place reflect their general credit behavior.
strengths and limitations of your model.
I tried to build a model based as: Investors ~ LoanOriginalAmount + CreditScoreRangeUpper. However, some spricification for linear regression is break, since the purpose is not to build a successful but to identify the general trend between variables, I didn’t dived further onto this model.
Even the deliquency status of loans in significant impacted by the credit score of borrowers, the monthly income also plays a role in the behavior of deliquency status. For this plot, it is obvious that lower credit score implies higher chance of deliquent loans. Borrowers with higher monthly income will have higher credit score and higher chance of pay out their loans.
From the Q-Q plot, it can ben seen that the distribution of investors is not normal, which is also shwon in the univariate analysis. From the Leverage plot, there’s a outlier located on the left of the graph. This group of points indicate that there are mulitple outliers while doing the regression. From the residual plot, it can be seen that as the number of investors goes large, its become harder and harder to predict the number of investors.
From this plot, we got a interesting finding that loans with higher loan original amount owns bettwe chance of good standing at the end. There couple theories for this behaviors. Firstly, this might because these loan has longer payment cycle so that they are still going through the pay back cycle. It coule also because borrowers borrowed a lot from loan companies tend to making better judgement of their financial situation.
The Prosper Marketplace is the first P2P loan marketplace in the United States and still operating today. This data set from Prosper Market place contains 81 variables and 113,937 observations. I picked about 20 features for analyzing.
Started from univariate analysis, I tried to understand the distribution of indivisual variables by using bar-plots and hitograms. I was expecting find much normal behaved variables but suprised by the log-normal and reversed lognormal behaviors found in these variables. For example, both the loan original amount and total credit line in past 7 years showing log normal behaviors. The number of investors ia also showing log-normal distribution in some sense.
I was struggled while doing linear regresssion about the number of investors. Specificly, although the number of investors, credit score and original loan amount are positively correlated, all specifications for linera regression are failed to hold. However, the dianosis plot successfully dipicted issues with the regression and helped the analysis. The ANOVA table indicated that both the credit score and the original loan amount is significant when predicting the number of investors.
Because of the good balance between categorical variables and numerical variables as well as the abundantness of information. I think this dataset would be a good resource for machine learning and neural network. I belive that after encode all categorical variables and apply machine learning algorithm, a impressive result is guaranteed.
https://yihui.name/knitr/options/
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
https://www.statmethods.net/stats/regression.html
https://stackoverflow.com/questions/41205489/different-fill-values-for-different-factors-in-ggplot2